Introduction

We listen to music in numerous ways: emotionally, analytically (for production and instrumental technique), narratively (for a story). So with all of these factors, what makes a good song? This project will try to predict music popularity (using The Grammy’s Recording Academy and Billboard Top Charts historical data) based on Spotify-Audio-features (AFTS). Detailed below and even further in the codebook, AFTS capture the different ways we listen to music. We will run (1) a classification analysis on the Grammy’s data and (2) a regression analysis on the Billboard’s data.

Why Use Grammys and Billboards Data?

Note: See the Data Collection for a more detailed explanation on how these records were extracted. The Recording Academy’s Grammy Awards are the most esteemed music awards and the only peer-recognized awards in the industry to date. Collected for this project are the historical “Song of the Year” nominations and winners from 1960-2021. The Billboard Charts are another measure of music popularity based off of sales, radio play, and streaming. In this data set, we have the historical annual Top 100 Year-End charts from 1960-2021. We will try to predict two things: (1) What level of AFTS are associated with Grammy nominations and wins for Song of the Year. (2) What level of AFTS are associated with Billboards positioning (1 to 100, 1 being the most popular). Between these two measurements, The Grammy’s and Billboard’s data, we capture both a high-level critical view as well as the public consensus for music popularity.

See the most recent Grammy’s Awards Nominations/Winners (2021)

See the most recent Billboard’s Year-End Top Charts (2021)

Spotify Audio Features

Note: See the Data Collection for a more detailed explanation on how these records were extracted. Spotify API allows you to extract several audio features for each song in their database. These values are based off of a machine learning audio analysis process. For this project we will use the following selected AFTS: acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, and valence. For a detailed description of each othese variables, please refer to the codebook in my zipped files and/or this link. Overall, these characteristics together captures the way we listen to music (analytically, narratively, emotionally). Please visit The Spotify Developers documentation to learn more about the Spotify API.

Data Collection and Web Scraper

The data collection for this project required a couple steps. The first was to create a data frame that combined each year of nominations/winners at the Grammy’s and Billboard’s Rankings. I found annnual CSVs for each year from 1960 to 2021 for sources through Wikipedia. For Grammy’s data I manually added a the win variable and year. For the Billboard’s data, the position was already a column, so I only added year. So at this point I had 2 data sets: 1 for Grammy’s data, and 1 for Billboards data. I then wrote a python function that merged these two data sets. Please see the manage.py file in my zipped files for details. This basically matched or appended the Grammy’s data to the Billboard’s data. Thus each row had a song name, artist name, position ranking, and Grammy’s status. For songs that were not on the Billboard’s charts but were nominated were assigned a ranking of 101. I added a decade variable for EDA purposes. In this preliminary data set, there were around 6300 observations. Next, I created a Python app that connected to Spotify API. This collection process took a couple weeks due to run-time and the amount of calls! My computer did not enjoy processing everything. Please see the collect.py file in my zipped files. Essentially this app took each row’s song name and artist name from the preliminary data set and queried the Spotify data base for AFTS, and then if the song was found AFTS values were filled in for respective columns. Since, some names were spelled incorrectly from Wikipedia, I had to double check this data frame. So in this final step, I double checked missing data manually and filled in AFTS where I could. However, there were still about twenty songs that could not be found in Spotify. I decided to drop these observations because AFTS are essential.

Exploratory Data Analysis

Data Frame Dimensions

dim(data)
## [1] 6291   15

There are 6291 rows and 15 columns. Each observation is a song.

A brief description of some of the variables are as follows:

  • songname : Name of the song

  • artistname : Artist name(s) of the song

  • year: year of rating/award

  • decade : decade of rating/award

  • position: position on the Billboards chart (1 to 100{1 being best}, 101{if not on chart})

  • win: indicator for Grammy Song of the Year distinction (2{won}, 0{nominated}, 3{otherwise})

  • audio feature variables (AFTS): Spotify rated audio feature variables, see codebook for specifics

    • acousticness: how acoustic the song is, rating 0.00-1.00, 1.00 being most acoustic
    • danceability: how suitable a song is to dance to, rating 0.00-1.00, 1.00 being more danceable
    • energy: how intense, noisy, loud, and fast a song is, rating 0.00-1.00, 1.00 being most energetic
    • instrumentalness: predicts whether a track contains mostly instruments (no vocals) rating 0.00-1.00, 1.00 being mostly instruments
    • liveness: how upbeat the song is, rating 0.00-1.00, 1.00 being more upbeat
    • loudness: on average how loud the song is, measured in decibels
    • speechiness: how spoken the song is, rating 0.00-1.00, 1.00 being mostly spoken
    • tempo: on average the estimated tempo of the song, measured in beats per minute
    • valence: how happy/positive sounding the song is, rating 0.00-1.00, 1.00 having more valence

see codebook in zipped files for more details on audio features and the other variables

Data Cleaning

Due to the web scraping process, the data frame is fairly tidy. Here we will convert nominal variables to factors, make any level adjustments, and clean up any other aspects of the data that will improve the efficiency of the analysis.

  • convert year, decade, win variables to factors And here are the first 50 rows of the dataset!
data$year <- as.factor(data$year)
data$win <- as.factor(data$win)
data$decade <- as.factor(data$decade)
levels(data$win) <- c("none", "nominated", "won")
levels(data$decade) <- c("60s", "70s", "80s", "90s", "200s", "2010s", "2020s")

music_hist <- data
head(music_hist, n = 50) %>%
  kable %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = '700px', height = "300px")
songname artistname year decade position win acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence
Theme from A Summer Place Percy Faith 1960 60s 1 won 0.5910 0.326 0.3260 9.18e-01 0.4940 -15.144 0.0297 186.232 0.870
He’ll Have to Go Jim Reeves 1960 60s 2 none 0.9090 0.554 0.1860 1.44e-03 0.1100 -15.846 0.0379 81.181 0.200
Cathy’s Clown The Everly Brothers 1960 60s 3 none 0.5480 0.507 0.5610 0.00e+00 0.5900 -10.472 0.0353 120.042 0.890
Running Bear Johnny Preston 1960 60s 4 none 0.7750 0.760 0.4680 2.36e-05 0.1840 -8.957 0.0482 119.986 0.745
Teen Angel Mark Dinning 1960 60s 5 none 0.9360 0.569 0.0638 0.00e+00 0.1220 -18.548 0.0458 101.521 0.282
I’m Sorry Brenda Lee 1960 60s 6 none 0.8680 0.558 0.2230 9.72e-04 0.1300 -12.362 0.0299 101.711 0.303
It’s Now or Never Elvis Presley 1960 60s 7 none 0.6420 0.643 0.4910 9.72e-03 0.2860 -9.312 0.0344 126.399 0.753
Handy Man Jimmy Jones 1960 60s 8 none 0.4000 0.477 0.7810 0.00e+00 0.3310 -6.931 0.0654 144.976 0.750
Stuck on You Elvis Presley 1960 60s 9 none 0.7580 0.647 0.5130 8.60e-06 0.1080 -12.372 0.0421 131.641 0.955
The Twist Chubby Checker 1960 60s 10 none 0.1800 0.571 0.6180 2.70e-06 0.1780 -5.682 0.0371 156.892 0.852
Everybody’s Somebody’s Fool Connie Francis 1960 60s 11 none 0.7450 0.584 0.7370 1.93e-05 0.2080 -4.643 0.0306 84.322 0.753
Wild One Bobby Rydell 1960 60s 12 none 0.6790 0.598 0.7140 3.00e-06 0.0539 -6.981 0.0443 148.876 0.853
Greenfields The Brothers Four 1960 60s 13 none 0.8630 0.466 0.1250 0.00e+00 0.1210 -20.428 0.0372 111.430 0.338
What in the World’s Come Over You Jack Scott 1960 60s 14 none 0.8020 0.506 0.1760 3.90e-06 0.0978 -13.962 0.0303 76.273 0.492
El Paso Marty Robbins 1960 60s 15 none 0.8350 0.654 0.4520 2.89e-05 0.1600 -9.709 0.0300 106.662 0.691
Alley Oop The Hollywood Argyles 1960 60s 16 none 0.8090 0.584 0.4760 0.00e+00 0.0915 -9.562 0.0614 63.325 0.944
My Heart Has a Mind of Its Own Connie Francis 1960 60s 17 none 0.8190 0.495 0.5390 0.00e+00 0.3250 -6.088 0.0336 109.490 0.715
Sweet Nothin’s Brenda Lee 1960 60s 18 none 0.6420 0.778 0.4130 1.04e-03 0.1480 -10.551 0.0514 125.235 0.961
Itsy Bitsy Teenie Weenie Yellow Polka Dot Bikini Brian Hyland 1960 60s 19 none 0.5580 0.814 0.4270 0.00e+00 0.0248 -11.543 0.0918 123.109 0.964
Only the Lonely Roy Orbison 1960 60s 20 none 0.3770 0.570 0.5290 5.09e-03 0.2030 -10.769 0.0280 123.273 0.934
Where or When Dion and the Belmonts 1960 60s 21 none 0.7220 0.449 0.3950 0.00e+00 0.1830 -6.389 0.0277 110.501 0.393
Sixteen Reasons Connie Stevens 1960 60s 22 none 0.8650 0.339 0.4060 1.10e-05 0.1100 -8.955 0.0319 109.783 0.619
Puppy Love Paul Anka 1960 60s 23 none 0.6670 0.431 0.3210 0.00e+00 0.1920 -11.827 0.0289 103.164 0.518
Why Frankie Avalon 1960 60s 24 none 0.7620 0.510 0.3490 0.00e+00 0.1300 -8.677 0.0264 94.267 0.586
Walk Don’t Run The Ventures 1960 60s 25 none 0.8520 0.488 0.6480 9.14e-01 0.1300 -13.252 0.0305 156.350 0.949
Save the Last Dance for Me The Drifters 1960 60s 26 none 0.6140 0.540 0.5300 0.00e+00 0.1980 -10.583 0.0361 143.453 0.896
Baby (You’ve Got What It Takes) Dinah Washington 1960 60s 27 none 0.8520 0.670 0.5960 2.03e-03 0.6530 -9.347 0.0627 133.396 0.813
Sink the Bismarck Johnny Horton 1960 60s 28 none 0.6520 0.680 0.5700 1.18e-05 0.0565 -12.388 0.0958 115.894 0.966
Chain Gang Sam Cooke 1960 60s 29 none 0.7300 0.703 0.7240 0.00e+00 0.5180 -10.818 0.0467 131.821 0.963
Let It Be Me The Everly Brothers 1960 60s 30 none 0.7700 0.471 0.1900 3.85e-03 0.1290 -16.046 0.0280 72.764 0.305
Good Timin’ Jimmy Jones 1960 60s 31 none 0.6090 0.552 0.5620 1.25e-04 0.0930 -7.682 0.0412 147.384 0.971
Beyond the Sea Bobby Darin 1960 60s 32 none 0.7230 0.521 0.5160 0.00e+00 0.2570 -7.456 0.0369 136.483 0.569
Go, Jimmy, Go Jimmy Clanton 1960 60s 33 none 0.2420 0.487 0.6670 0.00e+00 0.1050 -4.783 0.0358 137.492 0.782
Night Jackie Wilson 1960 60s 34 none 0.9180 0.284 0.4980 9.70e-04 0.3610 -5.534 0.0296 99.075 0.327
Burning Bridges Jack Scott 1960 60s 35 none 0.8000 0.471 0.2500 3.23e-04 0.1240 -14.637 0.0246 82.647 0.232
The Big Hurt Toni Fisher 1960 60s 36 none 0.8750 0.511 0.5240 0.00e+00 0.1220 -14.652 0.0395 127.444 0.284
Because They’re Young Duane Eddy 1960 60s 37 none 0.0122 0.701 0.6230 7.43e-01 0.0858 -10.508 0.0332 120.567 0.848
Lonely Blue Boy Conway Twitty 1960 60s 38 none 0.7770 0.659 0.3420 1.26e-05 0.2480 -13.709 0.0353 112.878 0.677
Pretty Blue Eyes Steve Lawrence 1960 60s 39 none 0.4970 0.477 0.5070 0.00e+00 0.3600 -11.802 0.0328 126.419 0.864
Way Down Yonder in New Orleans Freddy Cannon 1960 60s 40 none 0.7490 0.453 0.7140 2.00e-06 0.1240 -9.083 0.0475 142.737 0.881
Paper Roses Anita Bryant 1960 60s 41 none 0.9020 0.350 0.1240 2.07e-05 0.1340 -18.923 0.0339 117.169 0.320
Mr. Custer Larry Verne 1960 60s 42 none 0.7070 0.697 0.6970 8.72e-05 0.1560 -4.863 0.0537 108.500 0.582
I Want to Be Wanted Brenda Lee 1960 60s 43 none 0.5930 0.515 0.4120 0.00e+00 0.2180 -6.322 0.0277 107.610 0.392
Mule Skinner Blues The Fendermen 1960 60s 44 none 0.8140 0.510 0.5920 0.00e+00 0.0933 -6.387 0.0384 128.880 0.545
Cradle of Love Johnny Preston 1960 60s 45 none 0.2720 0.450 0.5920 1.81e-02 0.0852 -6.740 0.0760 173.986 0.796
You Got What It Takes Marv Johnson 1960 60s 46 none 0.7140 0.740 0.7050 2.82e-05 0.0956 -7.994 0.0460 130.348 0.960
Please Help Me, I’m Falling Hank Locklin 1960 60s 47 none 0.8280 0.552 0.3360 0.00e+00 0.3240 -10.712 0.0311 105.479 0.553
Love You So Ron Holden 1960 60s 48 none 0.8900 0.634 0.2750 1.11e-03 0.1370 -11.417 0.0348 127.294 0.581
Finger Poppin’ Time Hank Ballard & The Midnighters 1960 60s 49 none 0.0735 0.533 0.7590 0.00e+00 0.3150 -5.199 0.0356 159.373 0.889
Harbor Lights The Platters 1960 60s 50 none 0.9130 0.260 0.2900 6.49e-04 0.1530 -13.380 0.0293 78.797 0.303

During manual check of data, I removed songs that were not in the Spotify database. This should have eliminated any chance of missing variables. Let’s double check for NA values. If present, drop the columns.

  • check for NA values
apply(data, 2, function(x) any(is.na(x)))
##         songname       artistname             year           decade 
##            FALSE            FALSE            FALSE            FALSE 
##         position              win     acousticness     danceability 
##            FALSE            FALSE            FALSE            FALSE 
##           energy instrumentalness         liveness         loudness 
##            FALSE            FALSE            FALSE            FALSE 
##      speechiness            tempo          valence 
##            FALSE            FALSE            FALSE

There are no missing values. We are ready to investigate how our variables interact with each other

Audio Features Summary

Let’s pivot the data from wide to long so that all AFTS are under one variable, ranking. This will allow me to build graphics better during this EDA. Here are the first 50 rows of that reshaped data set.

#pivoted music_hist frame for further investigation
data_mod <- music_hist %>%
  pivot_longer(
    cols = c("acousticness", "danceability", "energy", "instrumentalness",
             "liveness", "loudness", "speechiness", "tempo", "valence"),
    names_to = "aft", 
    values_to = "rating"
  )
head(data_mod, n = 50) %>%
  kable %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = '700px', height = "300px")
songname artistname year decade position win aft rating
Theme from A Summer Place Percy Faith 1960 60s 1 won acousticness 0.5910000
Theme from A Summer Place Percy Faith 1960 60s 1 won danceability 0.3260000
Theme from A Summer Place Percy Faith 1960 60s 1 won energy 0.3260000
Theme from A Summer Place Percy Faith 1960 60s 1 won instrumentalness 0.9180000
Theme from A Summer Place Percy Faith 1960 60s 1 won liveness 0.4940000
Theme from A Summer Place Percy Faith 1960 60s 1 won loudness -15.1440000
Theme from A Summer Place Percy Faith 1960 60s 1 won speechiness 0.0297000
Theme from A Summer Place Percy Faith 1960 60s 1 won tempo 186.2320000
Theme from A Summer Place Percy Faith 1960 60s 1 won valence 0.8700000
He’ll Have to Go Jim Reeves 1960 60s 2 none acousticness 0.9090000
He’ll Have to Go Jim Reeves 1960 60s 2 none danceability 0.5540000
He’ll Have to Go Jim Reeves 1960 60s 2 none energy 0.1860000
He’ll Have to Go Jim Reeves 1960 60s 2 none instrumentalness 0.0014400
He’ll Have to Go Jim Reeves 1960 60s 2 none liveness 0.1100000
He’ll Have to Go Jim Reeves 1960 60s 2 none loudness -15.8460000
He’ll Have to Go Jim Reeves 1960 60s 2 none speechiness 0.0379000
He’ll Have to Go Jim Reeves 1960 60s 2 none tempo 81.1810000
He’ll Have to Go Jim Reeves 1960 60s 2 none valence 0.2000000
Cathy’s Clown The Everly Brothers 1960 60s 3 none acousticness 0.5480000
Cathy’s Clown The Everly Brothers 1960 60s 3 none danceability 0.5070000
Cathy’s Clown The Everly Brothers 1960 60s 3 none energy 0.5610000
Cathy’s Clown The Everly Brothers 1960 60s 3 none instrumentalness 0.0000000
Cathy’s Clown The Everly Brothers 1960 60s 3 none liveness 0.5900000
Cathy’s Clown The Everly Brothers 1960 60s 3 none loudness -10.4720000
Cathy’s Clown The Everly Brothers 1960 60s 3 none speechiness 0.0353000
Cathy’s Clown The Everly Brothers 1960 60s 3 none tempo 120.0420000
Cathy’s Clown The Everly Brothers 1960 60s 3 none valence 0.8900000
Running Bear Johnny Preston 1960 60s 4 none acousticness 0.7750000
Running Bear Johnny Preston 1960 60s 4 none danceability 0.7600000
Running Bear Johnny Preston 1960 60s 4 none energy 0.4680000
Running Bear Johnny Preston 1960 60s 4 none instrumentalness 0.0000236
Running Bear Johnny Preston 1960 60s 4 none liveness 0.1840000
Running Bear Johnny Preston 1960 60s 4 none loudness -8.9570000
Running Bear Johnny Preston 1960 60s 4 none speechiness 0.0482000
Running Bear Johnny Preston 1960 60s 4 none tempo 119.9860000
Running Bear Johnny Preston 1960 60s 4 none valence 0.7450000
Teen Angel Mark Dinning 1960 60s 5 none acousticness 0.9360000
Teen Angel Mark Dinning 1960 60s 5 none danceability 0.5690000
Teen Angel Mark Dinning 1960 60s 5 none energy 0.0638000
Teen Angel Mark Dinning 1960 60s 5 none instrumentalness 0.0000000
Teen Angel Mark Dinning 1960 60s 5 none liveness 0.1220000
Teen Angel Mark Dinning 1960 60s 5 none loudness -18.5480000
Teen Angel Mark Dinning 1960 60s 5 none speechiness 0.0458000
Teen Angel Mark Dinning 1960 60s 5 none tempo 101.5210000
Teen Angel Mark Dinning 1960 60s 5 none valence 0.2820000
I’m Sorry Brenda Lee 1960 60s 6 none acousticness 0.8680000
I’m Sorry Brenda Lee 1960 60s 6 none danceability 0.5580000
I’m Sorry Brenda Lee 1960 60s 6 none energy 0.2230000
I’m Sorry Brenda Lee 1960 60s 6 none instrumentalness 0.0009720
I’m Sorry Brenda Lee 1960 60s 6 none liveness 0.1300000
  • Audio feature averages for the complete data set
aft_avg <- data_mod %>% group_by(aft) %>% 
  summarise_at(vars(rating), list(rating = mean))
aft_avg %>%
  kable %>%
  kable_styling("striped", full_width = F) 
aft rating
acousticness 0.2646085
danceability 0.6259299
energy 0.6149965
instrumentalness 0.0374943
liveness 0.1759030
loudness -8.4695172
speechiness 0.0724138
tempo 119.4077783
valence 0.5988204
  • Audio Feature Summary Here is a quick glimpse at the complete data frame averages of the AFTS rated 0-1.00.
aft_hist <- ggplot(data=aft_avg[-c(6,8),], aes(reorder(x= aft, -rating),
                                           y=rating, fill=aft)) +
  geom_bar(stat="identity") + 
  xlab("Audio Feature") +
  ylab("Spotify Rating (0.00-1.0)")
aft_hist

Here we will examine how audio features interact with each other.
- Correlation matrix of audio features

music_hist[-c(4)] %>%
  dplyr::select(where(is.numeric)) %>%
  cor() %>%
  corrplot(type="lower", diag= FALSE, method = 'color')

Decades Summary of Audio Features

Music has changed drastically over time. Thus audio feature trends may have fluctuated over time. Let’s explore a decade-decade comparison of audio feature averages. It is important to consider if year should be controlled for/is an influential variable.

#Decades-Audio Feature Table:
options(dplyr.summarise.inform = FALSE)

dec_avg <- data_mod[-c(1,2,3,5,6)] %>%
  group_by(decade,aft) %>%
  dplyr::summarise(mean = mean(rating))

dec_table <- data.frame(dec_avg)
dec_table <- reshape2::dcast(dec_avg, decade ~ aft)
## Using mean as value column: use value.var to override.
dec_table %>%
  kable %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = '700px', height = "300px")
decade acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence
60s 0.5050905 0.5458441 0.5167829 0.0627306 0.1974617 -10.339513 0.0495471 120.3426 0.6745975
70s 0.3464703 0.5871392 0.5692609 0.0483917 0.1830250 -10.503112 0.0495957 118.4017 0.6640022
80s 0.2324466 0.6337428 0.6250042 0.0393305 0.1606487 -9.865360 0.0441348 120.5321 0.6389495
90s 0.2067170 0.6483998 0.6183073 0.0320656 0.1706651 -8.714538 0.0729747 116.8630 0.5607160
200s 0.1413434 0.6635125 0.7014338 0.0085183 0.1740826 -5.714801 0.1085999 118.4128 0.5720853
2010s 0.1601512 0.6676283 0.6622965 0.0224240 0.1666712 -6.010614 0.0988537 122.2088 0.4998555
2020s 0.2508438 0.6729615 0.5793221 0.1008203 0.1858822 -7.174293 0.1181976 119.5157 0.4830072


  • Plot of audio feature - decade average overtime:
acous_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'acousticness')
acous_dec <- ggplot(acous_df) +
  geom_boxplot(aes(x=aft, y=rating, color = decade))+
  theme(legend.position="none")

dance_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'danceability')
dance_dec <- ggplot(dance_df) +
  geom_boxplot(aes(x=aft, y=rating, color = decade))+
  theme(legend.position="none")

energy_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'energy')
energy_dec <- ggplot(energy_df) +
  geom_boxplot(aes(x=aft, y=rating, color = decade))+
  theme(legend.position="none")

instrum_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'instrumentalness')
instrum_dec <- ggplot(instrum_df) +
  geom_boxplot(aes(x=aft, y=rating, color = decade))+
  theme(legend.position="none")

live_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'liveness')
live_dec <- ggplot(live_df) +
  geom_boxplot(aes(x=aft, y=rating, color = decade))+
  theme(legend.position="none")

loud_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'loudness')
loud_dec <- ggplot(loud_df) +
  geom_boxplot(aes(x=aft, y=rating, color = decade))+
  theme(legend.position="none")

speech_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'speechiness')
speech_dec <- ggplot(speech_df) +
  geom_boxplot(aes(x=aft, y=rating, color = decade))+
  theme(legend.position="none")

tempo_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'tempo')
tempo_dec <- ggplot(tempo_df) +
  geom_boxplot(aes(x=aft, y=rating, color = decade))+
  theme(legend.position="none")

val_df <- subset(data_mod[-c(1,2,3,5,6)], aft == 'valence')
val_dec <- ggplot(val_df) +
  geom_boxplot(aes(x=aft, y=rating, color = decade))+
  theme(legend.position="none")

dummy_dec <- ggplot(val_df) +
  geom_boxplot(aes(x=aft, y=rating, color = decade))+
  theme(legend.direction = "vertical")
leg <- cowplot::get_legend(dummy_dec) 

r1 <- plot_grid(acous_dec, dance_dec)
r2 <- plot_grid(energy_dec, instrum_dec)
r3 <- plot_grid(live_dec, speech_dec)
r4 <- plot_grid(val_dec, tempo_dec)
r5 <- plot_grid(leg, loud_dec)
r5

plot_grid(r1,r2, nrow = 2)

plot_grid(r3,r4, nrow = 2)

There are some obvious changes in the decade averages seen through these boxplots for some AFTS. Overtime songs have certainly become more loud. I am assuming this is due to technology. Song have become less acoustic. This is probably also due to the new wave of electronic technology and insturmentation. Valence (or the happiness of a song) seems to be decreasing. This is an interesting feature to note. The other AFTS don’t seems to have much change or a certain trend overtime.Overall, while some AFTS on average don’t change much over time, there are some that certainly do. We mark music by a decade: 60’s is Rock n Roll, 70’s Disco, 80’s is.. ambigious, etc. Thus, as we can see there are some AFTS with distinct decade features. This should be considered when building the models.

Billbaords Summary of Audio Features

Let’s explore how ranking on the Billboards Year-End chart relate to the AFTS on average. It would be interesting to see if there are any obvious trends as to what puts a song near the top 10.

bill<-data_mod %>%
  dplyr::select(-c(1,2,3,4,6)) %>%
  group_by(position, aft) %>%
  summarise_at(vars(rating), list(rating = mean))

acous_b <- bill[bill$aft %in% "acousticness", ]
b1<- ggplot(acous_b, aes(x = position, y = rating)) +
  geom_line() +
  geom_smooth(method = lm) + 
  labs(title = "acousticness")

dance_b <- bill[bill$aft %in% "danceability", ]
b2 <- ggplot(dance_b, aes(x = position, y = rating)) +
  geom_line() +
  geom_smooth(method = lm) + 
  labs(title = "danceability")

energy_b <- bill[bill$aft %in% "energy", ]
b3 <- ggplot(energy_b, aes(x = position, y = rating)) +
  geom_line() +
  geom_smooth(method = lm)+ 
  labs(title = "energy")

instrum_b <- bill[bill$aft %in% "instrumentalness", ]
b4 <- ggplot(instrum_b, aes(x = position, y = rating)) +
  geom_line() +
  geom_smooth(method = lm) + 
  labs(title = "instrumentalness")

live_b <- bill[bill$aft %in% "liveness", ]
b5 <- ggplot(live_b, aes(x = position, y = rating)) +
  geom_line() +
  geom_smooth(method = lm) + 
  labs(title = "liveness")

speech_b <- bill[bill$aft %in% "speechiness", ]
b6 <-ggplot(speech_b, aes(x = position, y = rating)) +
  geom_line() +
  geom_smooth(method = lm) + 
  labs(title = "speechiness")

val_b <- bill[bill$aft %in% "valence", ]
b7 <- ggplot(val_b, aes(x = position, y = rating)) +
  geom_line() +
  geom_smooth(method = lm) + 
  labs(title = "valence")

tempo_b <- bill[bill$aft %in% "tempo",]
b8 <- ggplot(tempo_b, aes(x = position, y = rating)) +
  geom_line() +
  geom_smooth(method = lm) + 
  labs(title = "tempo")

loud_b <- bill[bill$aft %in% "loudness",]
b9 <- ggplot(loud_b, aes(x = position, y = rating)) +
  geom_line() +
  geom_smooth(method = lm) + 
  labs(title = "loudness")

plot_grid(b1,b2,b3,b4,b5,b6,b7,b8,b9, nrow = 3)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'


There are not very noticable trends in the ranking of the Billboards Top Charts and the AFTS. Most noticeably is the negative relationship in danceability and loudness with ranking. So perhaps more popular songs tend to be more danceable and louder on average.

Winners and Nominations Summary

Next let’s take a look at some density histograms for the AFTS, grouped by Grammy status. Here we can see if the distribution for AFTS is different at a glance for each audio feature. This may give insight as to if we can expect audio features to be good predictors for the Grammy’s Song of the Year.

gram<-data_mod %>%
  dplyr::select(-c(1,2,3,4,5)) 

acous_g <- gram[gram$aft %in% "acousticness", ]
g1 <- ggplot(acous_g, aes(x=rating, group=win, fill = win)) +
  geom_density(adjust = 2, alpha = 0.4) +
  theme(legend.position="none")  +
  labs(title = "acousticness")

dance_g <- gram[gram$aft %in% "danceability", ]
g2 <- ggplot(dance_g, aes(x=rating, group=win, fill = win)) +
  geom_density(adjust = 2, alpha = 0.4) +
  theme(legend.position="none") +
  labs(title ="danceability")

energy_g <- gram[gram$aft %in% "energy", ]
g3 <- ggplot(energy_g, aes(x=rating, group=win, fill = win)) +
  geom_density(adjust = 2, alpha = 0.4) +
  theme(legend.position="none") +
  labs(title ="energy")

instrum_g <- gram[gram$aft %in% "instrumentalness", ]
g4 <- ggplot(instrum_g, aes(x=rating, group=win, fill = win)) +
  geom_density(adjust = 2, alpha = 0.4) +
  theme(legend.position="none") +
  labs(title ="instrumentalness")

live_g <- gram[gram$aft %in% "liveness", ]
g5 <- ggplot(live_g, aes(x=rating, group=win, fill = win)) +
  geom_density(adjust = 2, alpha = 0.4) +
  theme(legend.position="none") +
  labs(title ="liveness")

speech_g <- gram[gram$aft %in% "speechiness", ]
g6 <- ggplot(speech_g, aes(x=rating, group=win, fill = win)) +
  geom_density(adjust = 2, alpha = 0.4) +
  theme(legend.position="none") +
  labs(title = "speechiness")

val_g <- gram[gram$aft %in% "valence", ]
g7 <- ggplot(val_g, aes(x=rating, group=win, fill = win)) +
  geom_density(adjust = 2, alpha = 0.4) +
  theme(legend.position="none") +
  labs(title = "valence")

tempo_g <- gram[gram$aft %in% "tempo", ]
g8 <- ggplot(tempo_g, aes(x=rating, group=win, fill = win)) +
  geom_density(adjust = 2, alpha = 0.4) +
  theme(legend.position="none") +
  labs(title = "tempo")

loud_g <- gram[gram$aft %in% "loudness", ]
g9 <- ggplot(loud_g, aes(x=rating, group=win, fill = win)) +
  geom_density(adjust = 2, alpha = 0.4) +
  theme(legend.position="none") +
  labs(title = "loudness")

gdummy <- ggplot(acous_g, aes(x=rating, group=win, fill = win)) +
  geom_density(adjust = 2, alpha = 0.4) 
  
leg2 <- legend <- cowplot::get_legend(gdummy)

x1 <- plot_grid(leg2,g9)
x2 <- plot_grid(g1,g2)
x3 <- plot_grid(g3,g4)
x4 <- plot_grid(g5,g6)
x5 <- plot_grid(g7,g8)
x1

plot_grid(x2,x3, nrow = 2)

plot_grid(x4,x5, nrow = 2)


The AFTS each have a very similar shape for whether the song won, is nominated, or neither for Song of the Year. Thus it may be fair to think that our models may have a hard time predicting the categories. That said, let’s check out a few classification models, and see if AFTS are good predictors for Grammy Song of the Year status.

Grammy Nominations & Winners: Classification

Our first set of models will intend to predict the Grammy’s Song of the Year status. We wll use the following protocol:

  • Split the data, stratifying on the variable ‘win’ (Grammy Song of the Year status)
  • Fold the data for cross validation (using 10 folds)
  • Create a recipe predicting ‘win’ using the all the AFTS plus some interaction variables
  • Fit 4 models (Random Forest, Nearest Neighbors, SVM, Lasso)
  • Find the best fit of the four models based on ROC-AUC metric and
  • Build the final model
  • Analyze the test data based on the final model

Data Splitting and Cross-Validation

Let’s begin my splitting our data. We will stratify our training set and test set with the variable ‘win’. Since there is only one winner for Song of the Year per Year, we want to make sure these get distributed evenly in our split.

set.seed(1027)
grammys_split <- music_hist %>%
  initial_split(prop = .8, strata = "win")
grammys_train <- training(grammys_split)
grammys_test <- testing(grammys_split)
dim(grammys_train)
## [1] 5032   15
dim(grammys_test)
## [1] 1259   15
  • Fold training data into 10 folds
grammys_folds <- vfold_cv(grammys_train, strata = "win", v = 10)

Recipe

grammys_recipe <- recipe(win ~ year + acousticness + danceability + # predict position using AFTS
                           energy + instrumentalness + liveness + 
                           loudness + speechiness + tempo + valence,
                         grammys_train) %>% 
  step_dummy(all_nominal_predictors()) %>%  # make sure all nominal variables are noted accordingly
  step_interact(~energy:acousticness + danceability:acousticness + # created interactions based on the most correlated
                  valence:danceability + energy:loudness +        # AFTS from the correlation matrix
                  loudness:acousticness) %>%
  step_center(all_predictors()) %>%  # Center and scale our variables
  step_scale(all_predictors()) 

Model Fitting

Using the recipe above I will fit the following 4 models:

RANDOM FOREST

In this random forest model we will use the ranger engine, set importance to impurity, set the mode to classification, and tune mtry, trees, and min_n. Next I set up a tuning grid with ranges for the tuned hyperparamters. Finally, I saved the tuned and fit model.

# Specs and workflow
rf_spec_g <- rand_forest() %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification") 

rf_wf_g <- workflow() %>%
  add_model(rf_spec_g %>% 
              set_args(mtry = tune(),
                       trees = tune(),
                       min_n = tune())) %>%
  add_recipe(grammys_recipe) 

# Tuning grid
param_grid_rf_g <- grid_regular(mtry(range = c(1,10)), trees(range= c(1,5)),
                               min_n(range = c(3,10)), levels = 10)

tune_rf_g <- tune_grid(
  rf_wf_g, 
  resamples = grammys_folds, 
  grid = param_grid_rf_g)

# save model (to avoid refitting later)
save(tune_rf_g, rf_wf_g, file = "data/model_fitting/tune_rf_g.rda")

NEAREST NEIGHBORS

In this nearest neighbors model we will use the kknn engine, set the mode to classification, and tune neighbors. Next I set up a tuning grid with a tuned neighbors hyperparameter. Finally, I saved the tuned and fit model.

knn_spec_g <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("classification")

knn_wf_g <- workflow() %>%
  add_model(knn_spec_g %>% set_args(neighbors = tune())) %>%
  add_recipe(grammys_recipe)

# Tuning grid
param_grid_knn_g <- grid_regular(neighbors(), levels = 10)

tune_knn_g <- tune_grid(
  knn_wf_g, 
  resamples = grammys_folds, 
  grid = param_grid_knn_g)

# save model (to avoid refitting later)
save(tune_knn_g, knn_wf_g, file = "data/model_fitting/tune_knn_g.rda")

SVM

In this SVM we will use the kernlab engine, set the mode to classification, and tune cost complexity. Next I set up a tuning grid with ranges for the cost hyperparamter. Finally, I saved the tuned and fit model.

# Specs and workflow
svm_spec_g <- svm_rbf() %>%
  set_mode("classification") %>%
  set_engine("kernlab")

svm_wf_g <- workflow() %>%
  add_model(svm_spec_g %>% set_args(cost = tune())) %>%
  add_recipe(grammys_recipe)

# Tuning grid
param_grid_svm_g <- grid_regular(cost(), levels = 10)

tune_svm_g <- tune_grid(
  svm_wf_g, 
  resamples = grammys_folds, 
  grid = param_grid_svm_g)

# Save model (to avoid refitting later)
save(tune_svm_g, svm_wf_g, file = "data/model_fitting/tune_svm_g.rda")

LASSO REGRESSION

In this Lasso model we will use the glmnet engine, set mixture to 1 to indicate a Lasso regularization, set the mode to classification, and tune penalty. Next I set up a tuning grid with ranges for the penalty hyperparamter. Finally, I saved the tuned and fit model.

# Specs and Workflow
lasso_spec_g <- 
  multinom_reg(penalty = tune(), mixture = 1) %>% 
  set_mode("classification") %>%
  set_engine("glmnet")

lasso_wf_g<- workflow() %>% 
  add_recipe(grammys_recipe) %>% 
  add_model(lasso_spec_g)

# Tuning grid
penalty_grid_lasso_g <- grid_regular(penalty(range = c(-10, 10)), levels = 10)

tune_lasso_g <- tune_grid(
  lasso_wf_g,
  resamples = grammys_folds, 
  grid = penalty_grid_lasso_g
)

# Save model (to avoid refitting later)
save(tune_lasso_g, lasso_wf_g, file = "data/model_fitting/tune_lasso_g.rda")

Model Selection and Performance

Since we saved our models to avoid refitting, we must load them in the following steps.

load("data/model_fitting/tune_rf_g.rda")
load("data/model_fitting/tune_knn_g.rda")
load("data/model_fitting/tune_svm_g.rda")
load("data/model_fitting/tune_lasso_g.rda")

Random Forest Model

autoplot(tune_rf_g, metric = 'roc_auc')

show_best(tune_rf_g, metric = "roc_auc") 
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     1     2     3 roc_auc hand_till  0.550    10  0.0151 Preprocessor1_Model0…
## 2     3     2     6 roc_auc hand_till  0.546    10  0.0266 Preprocessor1_Model1…
## 3     2     2     5 roc_auc hand_till  0.541    10  0.0180 Preprocessor1_Model1…
## 4     9     3     8 roc_auc hand_till  0.540    10  0.0193 Preprocessor1_Model2…
## 5     2     5    10 roc_auc hand_till  0.539    10  0.0214 Preprocessor1_Model3…

From the show_best(), the highest AUC mean is 0.5499734 where mtry is 1, trees is 2, and min_n is 3. This means that this model had ~55% correct predictions. This is not that high, but I would say from our EDA this is fairly expected.

KKNN Model

autoplot(tune_knn_g, metric = 'roc_auc')

show_best(tune_knn_g, metric = "roc_auc")
## # A tibble: 5 × 7
##   neighbors .metric .estimator  mean     n std_err .config              
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1         5 roc_auc hand_till  0.501    10 0.00623 Preprocessor1_Model05
## 2         1 roc_auc hand_till  0.500    10 0.00384 Preprocessor1_Model01
## 3         6 roc_auc hand_till  0.497    10 0.00620 Preprocessor1_Model06
## 4         2 roc_auc hand_till  0.497    10 0.00415 Preprocessor1_Model02
## 5         3 roc_auc hand_till  0.496    10 0.00472 Preprocessor1_Model03

From the show_best(), the highest AUC mean is 0.5008696 where neighbors is 5. This means that this model had ~50% correct predictions. This is even less than the RF model selected above.

SVM Model

autoplot(tune_svm_g, metric = 'roc_auc')

show_best(tune_svm_g, metric = "roc_auc")
## # A tibble: 5 × 7
##      cost .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1  3.17   roc_auc hand_till  0.598    10  0.0183 Preprocessor1_Model08
## 2  0.0992 roc_auc hand_till  0.591    10  0.0191 Preprocessor1_Model05
## 3  0.315  roc_auc hand_till  0.584    10  0.0162 Preprocessor1_Model06
## 4 32      roc_auc hand_till  0.584    10  0.0208 Preprocessor1_Model10
## 5 10.1    roc_auc hand_till  0.583    10  0.0154 Preprocessor1_Model09

From the show_best(), the highest AUC mean is 0.5978743 where cost is 3.17. This means that this model had ~60% correct predictions. That is quite the improvement. So far this is the best model.

Lasso Model

autoplot(tune_lasso_g, metric = 'roc_auc')

show_best(tune_lasso_g, metric = "roc_auc")
## # A tibble: 5 × 7
##         penalty .metric .estimator  mean     n std_err .config              
##           <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1        0.0774 roc_auc hand_till    0.5    10       0 Preprocessor1_Model05
## 2       12.9    roc_auc hand_till    0.5    10       0 Preprocessor1_Model06
## 3     2154.     roc_auc hand_till    0.5    10       0 Preprocessor1_Model07
## 4   359381.     roc_auc hand_till    0.5    10       0 Preprocessor1_Model08
## 5 59948425.     roc_auc hand_till    0.5    10       0 Preprocessor1_Model09

From the show_best(), the highest AUC mean is 0.5 where penalty is .0074. This means that this model had ~50% correct predictions. This is not the best model to choose from.

The SVM model performed the best. Let’s use this in our final workflow.

Final Model Building

Our best performing model was the random forest neighbors model! Next we will create a final workflow with the best nearest neighbors model using select_best().

svm_wf_tuned_g <- svm_wf_g %>%
  finalize_workflow(select_best(tune_svm_g, metric = 'roc_auc'))
svm_g_fit <- fit(svm_wf_tuned_g, grammys_train)

Analysis of Test Set

We will now fit the finalized model to our test data, and see how it performs!

aug <- augment(svm_g_fit, new_data = grammys_test) 
tbl_g <- aug %>% roc_auc(truth = win, estimate =c(
  .pred_none, .pred_nominated, .pred_won))
roc_curv_g <- aug %>% roc_curve(truth = win, estimate =c(
  .pred_none, .pred_nominated, .pred_won)) %>% autoplot()
final_plot_g <- aug %>% conf_mat(truth = win, estimate =.pred_class) %>%
  autoplot(type = "heatmap")

tbl_g
roc_curv_g
final_plot_g

The model returned a AUC of 0.4931. This is ~0.1 less than what we got on the training set. The SVM model did not perform too well. The ROC curves and confusion matrix visualize this sentiment. It seems that the model only wants to categorize our data into the ‘none’ bucket. From our EDA, we did see that there was not a huge difference in distribution for the desntiy histograms of the AFTS. So, it seems that perhaps there may not be a big enough difference in the categories’ audio features after all. In other words, there is not a great distinction in audio features for nominated and winning songs. The SVM may have chosen the ‘none’ category because there were far more observations in the entire data set to begin with. While our model does not do a tremendous job predicting anyting, this is telling for the Grammy’s. We can conclude that the Grammy’s Song of the Year category has a fairly diverse taste in music. Nominations and winners may not follow a trend according to the Recording Academy.

Billboards Position: Regression

While Grammy’s categories may not have been distinct enough in AFTS, perhaps the ML models will have better luck with the Billboard’s rankings. Thus, our next set of models will intend to predict the Position on the Year-End Billboard’ position Note, we left Position as a numerical variable, so some position predictions will be decimal places. We will use the following protocol:

  • Split the data, stratifying on the variable ‘position’ (Year-End Billboard’s Charts ranking)
  • Fold the data for cross validation (using 10 folds)
  • Create a recipe predicting ‘position’ using the all the AFTS plus some interaction variables
  • Fit 4 models (Random Forest, Nearest Neighbors, SVM, Lasso)
  • Find the best fit of the four models based on RMSE metric and
  • Build the final model
  • Analyze the test data based on the final model

Data Splitting and Cross-Validation

  • The split
set.seed(1027)
billboards_split <- music_hist %>%
  initial_split(prop = .8, strata = "position")
billboards_train <- training(billboards_split)
billboards_test <- testing(billboards_split)
dim(billboards_train)
## [1] 5032   15
dim(billboards_test)
## [1] 1259   15
  • Fold training data into 10 folds
billboards_folds <- vfold_cv(billboards_train, strata = "position", v = 10)

Recipe

billboards_recipe <- recipe(position ~ year + acousticness + danceability + # predict position using AFTS
                           energy + instrumentalness + liveness + 
                           loudness + speechiness + tempo + valence,
                         billboards_train) %>% 
  step_dummy(all_nominal_predictors()) %>% # make sure all nominal variables are noted accordingly
  step_interact(~energy:acousticness + danceability:acousticness + # created interactions based on the most correlated
                  valence:danceability + energy:loudness +         # AFTS from the correlatoin matrix
                  loudness:acousticness) %>%
  step_center(all_predictors()) %>%  # Center and scale our variables
  step_scale(all_predictors()) 

Model Fitting

Using the recipe above I will fit the following 4 models:

RANDOM FOREST

In this random forest model we will use the ranger engine, set importance to impurity, set the mode to regression, and tune mtry, trees, and min_n. Next I set up a tuning grid with ranges for the tuned hyperparamters. Finally, I saved the tuned and fit model.

rf_spec_b <- rand_forest() %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression") 

rf_wf_b <- workflow() %>%
  add_model(rf_spec_b %>% 
              set_args(mtry = tune(),
                       trees = tune(),
                       min_n = tune())) %>%
  add_recipe(billboards_recipe) 

param_grid_rf_b <- grid_regular(mtry(range = c(1,10)), trees(range= c(1,5)),
                               min_n(range = c(3,10)), levels = 10)

tune_rf_b <- tune_grid(
  rf_wf_b, 
  resamples = billboards_folds, 
  grid = param_grid_rf_b)

save(tune_rf_b, rf_wf_b, file = "data/model_fitting/tune_rf_b.rda")

NEAREST NEIGHBORS

In this nearest neighbors model we will use the kknn engine, set the mode to regression, and tune neighbors. Next I set up a tuning grid wthe neighbors hyperparameter. Finally, I saved the tuned and fit model.

knn_spec_b <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression")

knn_wf_b <- workflow() %>%
  add_model(knn_spec_b %>% set_args(neighbors = tune())) %>%
  add_recipe(billboards_recipe)

param_grid_knn_b <- grid_regular(neighbors(), levels = 10)

tune_knn_b <- tune_grid(
  knn_wf_b, 
  resamples = billboards_folds, 
  grid = param_grid_knn_b
)

save(tune_knn_b, knn_wf_b, file = "data/model_fitting/tune_knn_b.rda")

SVM

In this SVM we will use the kernlab engine, set the mode to regresion``, and tunecost complexity`. Next I set up a tuning grid with ranges for the cost hyperparamter. Finally, I saved the tuned and fit model.

svm_spec_b <- svm_rbf() %>%
  set_mode("regression") %>%
  set_engine("kernlab")

svm_wf_b <- workflow() %>%
  add_model(svm_spec_b %>% set_args(cost = tune())) %>%
  add_recipe(billboards_recipe)

param_grid_svm_b <- grid_regular(cost(), levels = 10)

tune_svm_b <- tune_grid(
  svm_wf_b, 
  resamples = billboards_folds, 
  grid = param_grid_svm_b
)

save(tune_svm_b, svm_wf_b, file = "data/model_fitting/tune_svm_b.rda")

LASSO REGRESSION

In this Lasso model we will use the glmnet engine, set mixture to 1 to indicate a Lasso regularization, set the mode to regression``, and tunepenalty`. Next I set up a tuning grid with ranges for the penalty hyperparamter. Finally, I saved the tuned and fit model.

lasso_spec_b <- 
  linear_reg(penalty = tune(), mixture = 1) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet") 

lasso_wf_b <- workflow() %>% 
  add_recipe(billboards_recipe) %>% 
  add_model(lasso_spec_b)

penalty_grid_lasso_b <- grid_regular(penalty(), levels = 10)

tune_lasso_b <- tune_grid(
  lasso_wf_b,
  resamples = grammys_folds, 
  grid = penalty_grid_lasso_b
)

save(tune_lasso_b, lasso_wf_b, file = "data/model_fitting/tune_lasso_b.rda")

Model Selection and Performance

Since we saved our models to avoid refitting, we must load them in the following steps.

load("data/model_fitting/tune_rf_b.rda")
load("data/model_fitting/tune_knn_b.rda")
load("data/model_fitting/tune_svm_b.rda")
load("data/model_fitting/tune_lasso_b.rda")

Random Forest Model

autoplot(tune_rf_b, metric = 'rmse')

show_best(tune_rf_b, metric = "rmse")
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     1     5    10 rmse    standard    29.4    10  0.0725 Preprocessor1_Model3…
## 2     1     4     8 rmse    standard    29.4    10  0.0714 Preprocessor1_Model2…
## 3     1     5     7 rmse    standard    29.4    10  0.0804 Preprocessor1_Model2…
## 4     1     4     7 rmse    standard    29.4    10  0.0596 Preprocessor1_Model2…
## 5     1     2     4 rmse    standard    29.4    10  0.0840 Preprocessor1_Model0…

From the show_best(), the lowest RMSE mean is 29.36425 where mtry is 1, trees is 5, and min_n is 10. This is a pretty high value, which may indicate that AFTS may not be great predictors for the Billboard’s data either. However, let’s look at the other models.

KKN Model

autoplot(tune_knn_b, metric = 'rmse')

show_best(tune_knn_b, metric = "rmse")
## # A tibble: 5 × 7
##   neighbors .metric .estimator  mean     n std_err .config              
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1        10 rmse    standard    32.1    10   0.148 Preprocessor1_Model10
## 2         9 rmse    standard    32.4    10   0.157 Preprocessor1_Model09
## 3         8 rmse    standard    32.7    10   0.169 Preprocessor1_Model08
## 4         7 rmse    standard    33.1    10   0.183 Preprocessor1_Model07
## 5         6 rmse    standard    33.6    10   0.196 Preprocessor1_Model06

From the show_best(), the lowest RMSE mean is 32.11814 where neighbors is 10. The RMSE is higher, so this nearest neighbors model performed worse.

SVM Model

autoplot(tune_svm_b, metric = 'rmse')

show_best(tune_svm_b, metric = "rmse")
## # A tibble: 5 × 7
##       cost .metric .estimator  mean     n std_err .config              
##      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1 0.000977 rmse    standard    29.3    10  0.0644 Preprocessor1_Model01
## 2 0.00310  rmse    standard    29.4    10  0.0643 Preprocessor1_Model02
## 3 0.00984  rmse    standard    29.4    10  0.0644 Preprocessor1_Model03
## 4 0.0312   rmse    standard    29.5    10  0.0674 Preprocessor1_Model04
## 5 0.0992   rmse    standard    29.6    10  0.0784 Preprocessor1_Model05

From the show_best(), the lowest RMSE mean is 29.34608 where cost is 0.0009765625. The RMSE is lowest thus far, so this is the best model so far.

Lasso Model

autoplot(tune_lasso_b, metric = 'rmse')

show_best(tune_lasso_b, metric = "rmse")
## # A tibble: 5 × 7
##        penalty .metric .estimator  mean     n std_err .config              
##          <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1 1            rmse    standard    29.3    10   0.169 Preprocessor1_Model10
## 2 0.0774       rmse    standard    29.6    10   0.161 Preprocessor1_Model09
## 3 0.00599      rmse    standard    29.7    10   0.147 Preprocessor1_Model08
## 4 0.000464     rmse    standard    29.7    10   0.146 Preprocessor1_Model07
## 5 0.0000000001 rmse    standard    29.7    10   0.145 Preprocessor1_Model01

From the show_best(), the lowest RMSE mean is 29.34967 where penalty is 1.0. The RMSE is just a bit higher than our SVM model.

Thus we will continue with the SVM model in our finalized work flow.

Final Model Building

Our best performing model was the nearest neighbors model! Next we will create a final workflow with the best nearest neighbors model using select_best().

svm_wf_tuned_b <- knn_wf_b %>%
  finalize_workflow(select_best(tune_svm_b, metric = 'rmse'))
svm_b_fit <- fit(svm_wf_tuned_b, billboards_train)
## Warning: tune samples were requested but there were 5032 rows in the data. 5027
## will be used.

Analysis of Test Set

  • SVM performance on test data results (RMSE and \(R^2\) values)
billboards_metric <- metric_set(rmse, rsq)
predictions <- predict(svm_b_fit, new_data = billboards_test) 
position <- billboards_test['position']
pos_dec <- billboards_test[c('position','decade')]
test <- billboards_test[c('songname','artistname', 'year','decade', 'position')]
model_test_predictions <- cbind(position, predictions)
mod_test_pred_info <- cbind(test, predictions)

#RMSE of Test
model_test_predictions %>% 
  billboards_metric(truth = position, estimate = .pred)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    29.3    
## 2 rsq     standard     0.00440
  • Songs with the best predictions (smallest difference between actual and predicted values)
# adding a difference column
mod_test_pred_info['difference'] = (model_test_predictions$position 
- model_test_predictions$.pred)

tbl2a <-mod_test_pred_info %>% mutate(difference = abs(difference))

tbl2b <- tbl2a %>%
  filter(difference <= 10)

tbl2b %>%
  kable %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = '700px', height = "300px")
songname artistname year decade position .pred difference
Theme from The Apartment Ferrante & Teicher 1960 60s 53 51.38268 1.6173248
The Village of St. Bernadette Andy Williams 1960 60s 56 51.41583 4.5841744
I Love How You Love Me The Paris Sisters 1961 60s 52 51.54716 0.4528377
The Way You Look Tonight The Lettermen 1961 60s 58 51.48512 6.5148807
Crying in the Rain The Everly Brothers 1962 60s 47 50.90827 3.9082734
Shout! Shout! (Knock Yourself Out) Ernie Maresca 1962 60s 49 50.85376 1.8537558
Smoky Places The Corsairs 1962 60s 51 51.53550 0.5354984
Green Onions Booker T. & the M.G.’s 1962 60s 53 50.53243 2.4675685
Mean Woman Blues Roy Orbison 1963 60s 53 51.68449 1.3155117
Memphis Johnny Rivers 1964 60s 41 50.77643 9.7764317
White on White Danny Williams 1964 60s 42 50.68295 8.6829499
Um, Um, Um, Um, Um, Um Major Lance 1964 60s 47 50.50596 3.5059583
I’m Telling You Now Freddie and the Dreamers 1965 60s 42 50.87109 8.8710941
The Seventh Son Johnny Rivers 1965 60s 45 51.66070 6.6607016
Down in the Boondocks Billy Joe Royal 1965 60s 52 50.89762 1.1023760
Hooray for Hazel Tommy Roe 1966 60s 44 51.01911 7.0191095
We Can Work It Out The Beatles 1966 60s 49 51.94941 2.9494080
Mercy, Mercy, Mercy The Buckinghams 1967 60s 51 50.96612 0.0338780
(Your Love Keeps Lifting Me) Higher and Higher Jackie Wilson 1967 60s 53 50.81787 2.1821331
Georgy Girl The Seekers 1967 60s 57 51.53067 5.4693283
California Nights Lesley Gore 1967 60s 61 51.86240 9.1376004
Classical Gas Mason Williams 1968 60s 43 51.12956 8.1295621
(Sweet Sweet Baby) Since You’ve Been Gone Aretha Franklin 1968 60s 46 50.46195 4.4619486
Reach Out of the Darkness Friend & Lover 1968 60s 49 51.25209 2.2520852
Jumpin’ Jack Flash The Rolling Stones 1968 60s 50 51.61875 1.6187460
MacArthur Park Richard Harris 1968 60s 51 51.71059 0.7105897
Take Time to Know Her Percy Sledge 1968 60s 54 51.50976 2.4902437
I Got the Feelin’ James Brown 1968 60s 58 50.47618 7.5238154
I’ve Gotta Be Me Sammy Davis Jr.  1969 60s 51 51.14837 0.1483681
Runaway Child, Running Wild The Temptations 1969 60s 57 51.51656 5.4834351
Galveston Glen Campbell 1969 60s 59 51.33192 7.6680816
Reflections of My Life Marmalade 1970 70s 43 51.68454 8.6845421
Hey There Lonely Girl Eddie Holman 1970 70s 44 51.28802 7.2880174
He Ain’t Heavy, He’s My Brother The Hollies 1970 70s 46 51.54880 5.5488023
Come and Get It Badfinger 1970 70s 48 50.96576 2.9657611
Turn Back the Hands of Time Tyrone Davis 1970 70s 51 50.99237 0.0076285
In the Summertime Mungo Jerry 1970 70s 53 50.58570 2.4142979
Bridge over Troubled Water Aretha Franklin 1971 70s 52 51.06465 0.9353502
Draggin’ the Line Tommy James 1971 70s 54 51.01941 2.9805893
Stay Awhile The Bells 1971 70s 57 51.05334 5.9466606
Sweet City Woman The Stampeders 1971 70s 58 50.70168 7.2983220
If Bread 1971 70s 61 51.23475 9.7652470
Morning Has Broken Cat Stevens 1972 70s 44 51.33553 7.3355334
I Can See Clearly Now Johnny Nash 1972 70s 47 50.79851 3.7985071
Jungle Fever The Chakachas 1972 70s 51 51.60232 0.6023159
Where Is the Love Roberta Flack & Donny Hathaway 1972 70s 58 51.68236 6.3176354
Give Me Love (Give Me Peace on Earth) George Harrison 1973 70s 42 51.69420 9.6942045
Feelin’ Stronger Every Day Chicago 1973 70s 54 52.04427 1.9557290
I Believe in You (You Believe in Me) Johnnie Taylor 1973 70s 58 50.84781 7.1521935
Mockingbird Carly Simon & James Taylor 1974 70s 52 51.49158 0.5084160
Never, Never Gonna Give You Up Barry White 1974 70s 55 51.41254 3.5874571
Chevy Van Sammy Johns 1975 70s 48 51.82902 3.8290166
Shannon Henry Gross 1976 70s 47 51.35771 4.3577075
Devil Woman Cliff Richard 1976 70s 55 50.54941 4.4505875
Lucille Kenny Rogers 1977 70s 43 51.05996 8.0599637
Handy Man James Taylor 1977 70s 46 50.80201 4.8020135
I Wish Stevie Wonder 1977 70s 51 51.04956 0.0495606
After the Lovin’ Engelbert Humperdinck 1977 70s 61 51.70819 9.2918065
Love Will Find a Way Pablo Cruise 1978 70s 44 51.36557 7.3655687
Love Is in the Air John Paul Young 1978 70s 46 51.43842 5.4384207
Thunder Island Jay Ferguson 1978 70s 50 51.64689 1.6468871
Here You Come Again Dolly Parton 1978 70s 60 51.09256 8.9074401
She Believes in Me Kenny Rogers 1979 70s 47 51.83634 4.8363442
In the Navy Village People 1979 70s 48 51.27700 3.2770046
The Devil Went Down to Georgia The Charlie Daniels Band 1979 70s 50 51.41426 1.4142591
We Are Family Sister Sledge 1979 70s 53 50.38720 2.6127982
Boogie Wonderland Earth, Wind & Fire 1979 70s 57 51.02389 5.9761146
Stomp! The Brothers Johnson 1980 80s 46 50.70625 4.7062473
Emotional Rescue The Rolling Stones 1980 80s 53 50.67600 2.3239994
You’re Only Lonely J.D. Souther 1980 80s 57 51.44420 5.5558018
How ’Bout Us Champaign 1981 80s 45 51.34459 6.3445891
America Neil Diamond 1981 80s 62 52.01692 9.9830784
Do You Believe in Love Huey Lewis and the News 1982 80s 51 51.63732 0.6373187
Wasted on the Way Crosby, Stills & Nash 1982 80s 57 51.73079 5.2692146
One Hundred Ways Quincy Jones 1982 80s 61 51.10581 9.8941907
1999 Prince 1983 80s 41 50.41881 9.4188082
Tell Her About It Billy Joel 1983 80s 45 50.91198 5.9119822
Too Shy Kajagoogoo 1983 80s 50 50.92466 0.9246598
Don’t Let It End Styx 1983 80s 60 51.61601 8.3839912
Twist of Fate Olivia Newton-John 1984 80s 42 51.38148 9.3814798
Let the Music Play Shannon 1984 80s 49 51.12616 2.1261579
Almost Paradise Mike Reno and Ann Wilson 1984 80s 59 51.01184 7.9881551
Legs ZZ Top 1984 80s 60 50.84951 9.1504861
Freeway of Love Aretha Franklin 1985 80s 43 50.78185 7.7818460
You Give Good Love Whitney Houston 1985 80s 47 50.96137 3.9613722
Raspberry Beret Prince & the Revolution 1985 80s 51 51.07033 0.0703265
The Boys of Summer Don Henley 1985 80s 53 51.28324 1.7167620
If You Love Somebody Set Them Free Sting 1985 80s 55 51.41054 3.5894570
We Don’t Need Another Hero (Thunderdome) Tina Turner 1985 80s 57 50.50589 6.4941054
Material Girl Madonna 1985 80s 58 50.22680 7.7731987
Axel F Harold Faltermeyer 1985 80s 61 51.69450 9.3055015
Danger Zone Kenny Loggins 1986 80s 42 51.81582 9.8158170
If You Leave Orchestral Manoeuvres in the Dark 1986 80s 53 51.70089 1.2991066
Invisible Touch Genesis 1986 80s 54 51.44331 2.5566886
All Cried Out Lisa Lisa and Cult Jam 1986 80s 61 52.02600 8.9739961
In Too Deep Genesis 1987 80s 47 51.09929 4.0992932
Let’s Wait Awhile Janet Jackson 1987 80s 48 51.17102 3.1710246
Little Lies Fleetwood Mac 1987 80s 51 51.18694 0.1869414
Carrie Europe 1987 80s 56 51.71122 4.2887841
Together Forever Rick Astley 1988 80s 44 51.60969 7.6096915
Out of the Blue Debbie Gibson 1988 80s 54 51.34833 2.6516727
Don’t You Want Me Jody Watley 1988 80s 55 51.01066 3.9893418
I Get Weak Belinda Carlisle 1988 80s 57 51.79300 5.2069958
Girlfriend Pebbles 1988 80s 60 50.23531 9.7646889
Dirty Diana Michael Jackson 1988 80s 61 51.14492 9.8550782
Bust a Move Young MC 1989 80s 42 50.60671 8.6067140
So Alive Love and Rockets 1989 80s 51 51.53978 0.5397772
Here and Now Luther Vandross 1990 90s 43 51.07842 8.0784194
No More Lies Michel’le 1990 90s 50 50.26567 0.2656734
Do You Remember? Phil Collins 1990 90s 53 50.62888 2.3711174
Black Cat Janet Jackson 1990 90s 59 50.83133 8.1686742
Love of a Lifetime FireHouse 1991 90s 43 51.66302 8.6630199
Love Is a Wonderful Thing Michael Bolton 1991 90s 49 50.87552 1.8755210
Tom’s Diner DNA 1991 90s 53 50.58501 2.4149931
Make It Happen Mariah Carey 1992 90s 42 50.48891 8.4889072
Breakin’ My Heart (Pretty Brown Eyes) Mint Condition 1992 90s 48 51.19091 3.1909115
Mysterious Ways U2 1992 90s 57 50.92015 6.0798531
Hip Hop Hooray Naughty by Nature 1993 90s 45 51.14501 6.1450067
Will You Be There Michael Jackson 1993 90s 47 51.22222 4.2222166
All That She Wants Ace of Base 1993 90s 51 50.80370 0.1963001
7 Prince and The New Power Generation 1993 90s 52 50.68570 1.3143021
Here We Go Again! Portrait 1993 90s 59 51.28721 7.7127875
Loser Beck 1994 90s 50 50.84596 0.8459649
Never Lie Immature 1994 90s 53 50.70103 2.2989700
Understanding Xscape 1994 90s 58 51.05689 6.9431103
I’ll Be There for You Method Man 1995 90s 42 50.61011 8.6101091
Before I Let You Go Blackstreet 1995 90s 46 51.00220 5.0021953
I Wanna Be Down Brandy 1995 90s 49 50.66276 1.6627553
Hold On Jamie Walters 1995 90s 52 51.53297 0.4670262
Diggin’ on You TLC 1996 90s 45 50.97852 5.9785199
Elevators (Me & You) Outkast 1996 90s 59 51.02793 7.9720719
Mouth Merril Bainbridge 1997 90s 42 50.56495 8.5649482
Invisible Man 98 Degrees 1997 90s 45 51.26420 6.2641997
Get It Together 702 1997 90s 48 50.74516 2.7451593
It’s All Coming Back to Me Now Céline Dion 1997 90s 50 51.21927 1.2192731
Never Make a Promise Dru Hill 1997 90s 56 50.94679 5.0532125
Everyday Is a Winding Road Sheryl Crow 1997 90s 60 51.12416 8.8758406
I Will Come to You Hanson 1998 90s 50 51.69779 1.6977917
Swing My Way K. P. & Envyi 1998 90s 52 50.34689 1.6531100
The Arms of the One Who Loves You Xscape 1998 90s 53 51.23550 1.7644993
My Love Is the Shhh! Somethin’ for the People 1998 90s 54 50.41280 3.5872000
Lookin’ at Me Mase 1998 90s 59 50.98053 8.0194709
Looking Through Your Eyes LeAnn Rimes 1998 90s 60 51.09702 8.9029782
From This Moment On Shania Twain 1999 90s 57 51.74147 5.2585259
Never Let You Go Third Eye Blind 2000 200s 43 50.68836 7.6883573
My Love Is Your Love Whitney Houston 2000 200s 47 51.13782 4.1378191
Wifey Next 2000 200s 53 50.62591 2.3740912
Purest of Pain (A Puro Dolor) Son by Four 2000 200s 61 51.02738 9.9726154
This I Promise You ’N Sync 2001 200s 51 51.61780 0.6177985
Only Time Enya 2001 200s 59 51.48496 7.5150352
I Do!! Toya 2001 200s 60 50.57182 9.4281846
Down 4 U Ja Rule 2002 200s 44 51.10625 7.1062486
Can’t Get You Out of My Head Kylie Minogue 2002 200s 45 50.58152 5.5815183
More Than A Woman Aaliyah 2002 200s 58 50.60543 7.3945716
I Can Nas 2003 200s 58 50.81895 7.1810525
Splash Waterfalls Ludacris 2004 200s 42 51.05559 9.0555932
With You Jessica Simpson 2004 200s 50 51.11747 1.1174680
My Happy Ending Avril Lavigne 2004 200s 54 51.54583 2.4541705
Roses OutKast 2004 200s 56 50.72983 5.2701737
Photograph Nickelback 2005 200s 43 51.48917 8.4891656
Collide Howie Day 2005 200s 45 51.29150 6.2915044
Slow Down Bobby Valentino 2005 200s 47 51.29531 4.2953120
My Boo Usher and Alicia Keys 2005 200s 54 51.29793 2.7020699
Yo (Excuse Me Miss) Chris Brown 2006 200s 44 51.11746 7.1174554
Walk Away Kelly Clarkson 2006 200s 45 50.25134 5.2513367
Sexy Love Ne-Yo 2006 200s 51 50.74380 0.2562013
So What Field Mob 2006 200s 55 50.22706 4.7729391
Make Me Better Fabolous 2007 200s 44 51.30994 7.3099359
Waiting on the World to Change John Mayer 2007 200s 47 51.41487 4.4148652
Because of You Ne-Yo 2007 200s 57 50.26307 6.7369288
I Tried Bone Thugs-n-Harmony 2007 200s 58 51.09258 6.9074216
Shawty Plies 2007 200s 60 51.78543 8.2145658
Our Song Taylor Swift 2008 200s 41 50.81990 9.8199019
Damaged Danity Kane 2008 200s 42 50.33752 8.3375157
Sorry Buckcherry 2008 200s 44 51.46058 7.4605799
Independent Webbie 2008 200s 45 51.35510 6.3551001
Can’t Believe It T-Pain 2008 200s 46 51.46675 5.4667518
Like You’ll Never See Me Again Alicia Keys 2008 200s 47 51.69597 4.6959693
What You Got Colby O’Donis 2008 200s 51 51.02000 0.0199960
Sweetest Girl (Dollar Bill) Wyclef Jean 2008 200s 56 50.64075 5.3592479
Miss Independent Ne-Yo 2008 200s 57 50.69724 6.3027559
Obsessed Mariah Carey 2009 200s 41 50.93942 9.9394194
Love Lockdown Kanye West 2009 200s 44 51.31590 7.3158955
Gotta Be Somebody Nickelback 2009 200s 51 51.55736 0.5573562
Beautiful Akon 2009 200s 54 50.75364 3.2463580
Bulletproof La Roux 2010 200s 42 50.62884 8.6288413
Hard Rihanna 2010 200s 49 51.94305 2.9430544
Young Forever Jay-Z 2010 200s 50 51.30314 1.3031445
According to You Orianthi 2010 200s 55 50.95062 4.0493810
Over Drake 2010 200s 60 50.85600 9.1440038
Animal Neon Trees 2010 200s 61 51.44956 9.5504411
Dynamite Taio Cruz 2011 2010s 44 50.41759 6.4175857
Moment 4 Life Nicki Minaj 2011 2010s 50 51.55146 1.5514642
Just a Dream Nelly 2011 2010s 52 51.49288 0.5071200
Motivation Kelly Rowland 2011 2010s 53 51.09167 1.9083346
Jar of Hearts Christina Perri 2011 2010s 55 51.22755 3.7724481
Hold It Against Me Britney Spears 2011 2010s 60 51.10170 8.8983011
Feel So Close Calvin Harris 2012 2010s 42 50.56060 8.5605972
Domino Jessie J 2012 2010s 46 50.39532 4.3953195
Home Phillip Phillips 2012 2010s 49 51.42802 2.4280172
Not Over You Gavin DeGraw 2012 2010s 60 51.50651 8.4934947
Summertime Sadness Lana Del Rey and Cédric Gervais 2013 2010s 45 51.55113 6.5511293
I Need Your Love Calvin Harris 2013 2010s 56 50.86921 5.1307858
Some Nights Fun 2013 2010s 58 51.14364 6.8563565
Classic MKTO 2014 2010s 50 50.63430 0.6342982
My Hitta YG 2014 2010s 58 50.56621 7.4337922
Hey Brother Avicii 2014 2010s 60 51.36996 8.6300429
Somebody Natalie La Rose 2015 2010s 41 50.30690 9.3069000
Nasty Freestyle T-Wayne 2015 2010s 50 50.84530 0.8452978
I Don’t Mind Usher 2015 2010s 55 50.58818 4.4118223
Wildest Dreams Taylor Swift 2015 2010s 57 51.22130 5.7787006
You Know You Like It DJ Snake and AlunaGeorge 2015 2010s 59 51.20170 7.7982986
Uma Thurman Fall Out Boy 2015 2010s 60 50.81071 9.1892943
Like I’m Gonna Lose You Meghan Trainor 2016 2010s 42 51.22235 9.2223513
Let Me Love You DJ Snake 2016 2010s 47 51.35401 4.3540086
We Don’t Talk Anymore Charlie Puth 2016 2010s 50 50.84914 0.8491360
Hands to Myself Selena Gomez 2016 2010s 56 50.63782 5.3621810
2 Phones Kevin Gates 2016 2010s 57 50.74652 6.2534827
In the Night The Weeknd 2016 2010s 61 51.54271 9.4572922
Rockabye Clean Bandit 2017 2010s 44 50.47385 6.4738514
Feel It Still Portugal. The Man 2017 2010s 45 50.43923 5.4392301
Bank Account 21 Savage 2017 2010s 48 50.86221 2.8622120
Heathens Twenty One Pilots 2017 2010s 58 50.86191 7.1380935
Sicko Mode Travis Scott 2018 2010s 42 51.00392 9.0039197
Gucci Gang Lil Pump 2018 2010s 44 50.18350 6.1835008
Too Good at Goodbyes Sam Smith 2018 2010s 49 50.80896 1.8089618
Bodak Yellow Cardi B 2018 2010s 54 50.45163 3.5483720
Wolves Selena Gomez and Marshmello 2018 2010s 60 51.17621 8.8237948
Bartier Cardi Cardi B 2018 2010s 61 51.15730 9.8426979
Look Back at It A Boogie wit da Hoodie 2019 2010s 41 50.67069 9.6706910
A Lot 21 Savage 2019 2010s 42 51.17316 9.1731560
Mia Bad Bunny 2019 2010s 44 51.16021 7.1602115
Beautiful Crazy Luke Combs 2019 2010s 46 51.14334 5.1433434
Ritmo (Bad Boys for Life) Black Eyed Peas and J Balvin 2020 2020s 50 50.46130 0.4612958
Nobody but You Blake Shelton and Gwen Stefani 2020 2020s 52 52.01898 0.0189764
Truth Hurts Lizzo 2020 2020s 55 50.85850 4.1415024
Yummy Justin Bieber 2020 2020s 58 50.78857 7.2114323
Got What I Got Jason Aldean 2020 2020s 60 51.03851 8.9614851
Dynamite BTS 2021 2020s 41 50.60775 9.6077492
Beat Box SpotemGottem 2021 2020s 44 51.16275 7.1627465
Telepatía Kali Uchis 2021 2020s 49 51.21703 2.2170348
Bang! AJR 2021 2020s 56 50.63090 5.3690985
Essence Wizkids 2021 2020s 60 50.56096 9.4390447
Chasing After You Ryan Hurd and Maren Morris 2021 2020s 61 51.17999 9.8200114

In the first table here we can see that the RMSE on the test data was 29.33202. This is even better than how the best SVM model performed on our training set, but still very close. We did not overfit! Thus, our model performed pretty well on the test data (relative to the training data). Overall, however, the final model has a pretty high RMSE and very low \(R^2\) value of 0.0044. Like, the Grammy’s data, our model is not very predicative of the Billboard’s ranking. It seems to fit most values within the position of 45 to 60. However, looking at the second table, we can see some songs where the difference between predicted value and actual value was less than or equal to 10. There are only 244 out of the 5027 test data points that fall under this criteria. However, maybe we can see if a certain decade has better prediction in the following plot.

# using similar plot from example final project!
tbl2a %>% filter(difference >= 29.33) %>%
ggplot(aes(x = position, y = .pred)) +
  geom_abline(lty=2) + geom_point(alpha = 1) +
  facet_wrap(~decade)+
  labs(
    title = "Test Data Set Predictions vs. Actual",
    subtitle = "Greater Than 29.33 Difference; 
    * Note:2020s Panel has less data overall due to sample size of decade",
    y = "Predicted Possition on Billboard's Year End Top 100",
    x = "Position on Billboard's Year-End Top 100"
  )


Since the 2020 Panel is negligent, we can compare the other six decades. There does not seem to be a huge outlying decade in terms of points that exceed an rmse greater than 29.33. So there are no outlying time periods. In sum, we can conclude that The Billboard’s data also has a hard time being modeled by AFTS. Perhaps popular music is unpredictable. What goes viral can be unexpected. In the conclusion, I will address further observations and reasons. However, at this point I will leave it at the fact that AFTS alone may are not good indicators for music popularity.

Conclusion

I wanted to explore, essentially, if there is a ‘recipe’ for popularity. Are there certain audio features that make a song more likable? The intention of this project was to see if we could determine song popularity using Spotify Audio Features. Popularity was determined by two sources: The Recording Academy’s Grammy Song of the Year award and The Billboard’s Year-End Top 100. I was hoping to see that Spotify AFTS would be distinct enough to determine Grammy status and Billboard positioning. Since these sources are independent from each other, a best fit model was found for each. Of the 4 classification models that were run to predict Grammy status, the SVM model had the highest AUC value of ~0.59. The other three models, Random Forest, Nearest Neighbors, and Lasso all performed similarly. Overall, none of the models had a great fit. The other AUC values did not exceed 0.50. However, the SVM model did perform marginally better. When fit to the Grammy’s test data, the SVM model performed not as well as the training set with an AUC of ~0.49. Of the 4 regression models that were run to predict Billboards positions, the SVM model also had the lowest RMSE value: ~29.346. Like the Grammy’s data the other models, Random Forest, Nearest Neighbors, and Lasso,performed similarly with RMSE values all around 30. So, when applied to the Billboard’s test data, the SVM model returned a very similar RMSE to the training data of 24.33.

While the SVM models we selected performed similarly on the respective test data, they did not perform well. First looking at the Grammy’s SVM model prediction on the test data, we were returned an AUC of ~0.49. That is 50% correctness. And as we saw from the confusion matrix, the model only wanted to categorize songs as ‘neither nominated or won’. Looking at the Billboard’s SVM model prediction on the the test data, we got an RMSE of ~24.33. As seen in the results table, most prediction ranged from 40 to 60 while the true values were from 1 to 100. Despite being the best performing models out of the models we fit, the SVM predictions were not ideal.

Considering we tested a variety of models and that none of them performed very well, it may be fair to conclude that the AFTS are not a good or sufficient enough predictor variables for music popularity. I think there are a few contributing factors to this. First, the way AFTS are measured may not entirely capture the individuality of a song. Maybe the way a song evokes emotion or it’s technical abilities cannot be captured entirely by Spotify’s machine learning analysis that determines the AFTS. Music theory, music production, and music cognition are highly complex topics that contribute to the popularity of music: A simple measurement may not accurately distinguish these concepts. Second, I suspect that maybe what’s popular in a given year could be a but unpredictable. From our EDA we found that some of the AFTS certainly decade to decade. However, within each decade the range of values had a wide spread. So, controlling for year, we see perhaps AFTS are sort of random. This notion that I am alluding to in a more general sense is that what goes viral (makes the Billboard’s Top 100) or what is deemed critically acclaimed (nominated for a Grammy’s) is unpredictable in a given year. The landscape of music is fast paced and constantly changing. The next hit could be a sound we’ve never heard. Who knows? Finally, my last reasoning is that the models we tested may not be complex enough for this data set. While we tested a wide range of models, it is possible that these could all be poor fits.

Exploring my love for music through machine learning has been incredibly fun and rewarding. I love working with Spotify API, and I am glad I got to create an app that could extract these features. Furthermore, this exploration has sparked more questions. A year to year or decade to decade comparison on AFTS would be an interesting follow up topic. There is no question that music has evolved overtime, and it would be interesting to explore if AFTS could predict what time period a song is from! Regardless on the success of the these models, I look forward to continuing my investigation on the possible ways AFTS can be used.